Skip to content

[VL] Adding configurations on max write file size#11606

Open
zhouyuan wants to merge 2 commits intoapache:mainfrom
zhouyuan:wip_config_writer
Open

[VL] Adding configurations on max write file size#11606
zhouyuan wants to merge 2 commits intoapache:mainfrom
zhouyuan:wip_config_writer

Conversation

@zhouyuan
Copy link
Member

What changes are proposed in this pull request?

Adding config for max write file size in Velox

How was this patch tested?

pass GHA
Velox UT

Was this patch authored or co-authored using generative AI tooling?

Signed-off-by: Yuan <yuanzhou@apache.org>
.createWithDefault(10000)

val MAX_TARGET_FILE_SIZE_SESSION =
buildConf("spark.gluten.sql.columnar.backend.velox.maxTargetFileSizeSession")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does Session mean here?

| spark.gluten.sql.columnar.backend.velox.maxSpillFileSize | 1GB | The maximum size of a single spill file created |
| spark.gluten.sql.columnar.backend.velox.maxSpillLevel | 4 | The max allowed spilling level with zero being the initial spilling level |
| spark.gluten.sql.columnar.backend.velox.maxSpillRunRows | 3M | The maximum row size of a single spill run |
| spark.gluten.sql.columnar.backend.velox.maxTargetFileSizeSession | 0b | The target file size for each output file when writing data. 0 means no limit on target file size, and the actual file size will be determined by other factors such as max partition number and shuffle batch size. |
Copy link
Contributor

@FelixYBW FelixYBW Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it map to iceberg's write.target-file-size-bytes? and honor spark.sql.iceberg.advisory-partition-size? If so let's honor this config in Gluten as well.

If it only take effect on iceberg, we may just reuse iceberg's config instead of a new config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today Velox is using kMaxTargetFileSize to control the parquet write file size, so it will impact on all parquet write. In current iceberg write code path, this is parameter is picked to control the parquet size in each partition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In iceberg java the logic for partition control is:
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java#L695-L702

        .option(SparkWriteOptions.ADVISORY_PARTITION_SIZE)
        .sessionConf(SparkSQLProperties.ADVISORY_PARTITION_SIZE)
        .tableProperty(TableProperties.SPARK_WRITE_ADVISORY_PARTITION_SIZE_BYTES)
        .defaultValue(defaultValue)

write_options > session > table property

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.createWithDefault(10000)

val MAX_TARGET_FILE_SIZE_SESSION =
buildConf("spark.gluten.sql.columnar.backend.velox.maxTargetFileSizeSession")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove Session suffix, this is Velox code config type suffix, not the config itself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Yuan <yuanzhou@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants